Skip to content

[HUDI-3841] Fixing Column Stats in the presence of Schema Evolution#5275

Merged
nsivabalan merged 6 commits intoapache:masterfrom
onehouseinc:ak/dskp-schm-ev
Apr 11, 2022
Merged

[HUDI-3841] Fixing Column Stats in the presence of Schema Evolution#5275
nsivabalan merged 6 commits intoapache:masterfrom
onehouseinc:ak/dskp-schm-ev

Conversation

@alexeykudinkin
Copy link
Contributor

Tips

What is the purpose of the pull request

Currently, Data Skipping is not handling correctly the case when column-stats are not aligned and, for ex, some of the (column, file) combinations are missing from the CSI.

This could occur in different scenarios (schema evolution, CSI config changes), and has to be handled properly when we're composing CSI projection for Data Skipping. This PR addresses that.

Brief change log

  • Added appropriate aligning for the transposed CSI projection

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).
This change added tests and can be verified as follows:

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

Alexey Kudinkin added 2 commits April 10, 2022 13:12
…every column is present for every file (due to schema evolution, CSI config changes, etc)
@alexeykudinkin alexeykudinkin changed the title [HUDI-3841][Stacked on 5244] Fixing Column Stats in the presence of Schema Evolution [HUDI-3841] Fixing Column Stats in the presence of Schema Evolution Apr 10, 2022
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this patch be able to handle column drops? Particularly, if the dropped column was part of HoodieMetadataConfig.COLUMN_STATS_INDEX_FOR_COLUMNS? On the reader side, we pass the latest tableSchema and not the fileSchema right? On the writer side, don't we need to cleanup this config or throw an error and ask the user to reset it. I know this is slightly tangential point. But if you think there is more work to be done for handling shema evolution comprehensively, then maybe create a followup ticket.

Comment on lines +136 to +141
// NOTE: We have to collect list of indexed columns to make sure we properly align the rows
// w/in the transposed dataset: since some files might not have all of the columns indexed
// either due to the Column Stats Index config changes, schema evolution, etc, we have
// to make sure that all of the rows w/in transposed data-frame are properly padded (with null
// values) for such file-column combinations
val indexedColumns: Seq[String] = colStatsDF.rdd.map(row => row.getString(colNameOrdinal)).distinct().collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do this one level above in readColumnStatsIndex so that colStatsDF itself is correctly populated and transposeColumnStatsIndex simply transposes as today?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

colStatsDF is populated correctly -- it bears 1 row / column (let's call it "row-based"), therefore for all columns in a file we will have N rows corresponding to it (eq to the # of columns in that file).

Transposed table is "column-based", ie there's 1 row / file and each column's stat is mapped to a column in such view. Therefore only in that view we have a need to align the rows (to pad them).

@nsivabalan
Copy link
Contributor

In the interest of time for 0.11 release, here is my take. I haven't looked at the changes as such. but my 2 cents given the last min changes.

We can ignore scheme evolution, CSI config changes (list of columns to index) for now. We can call out that CSI configs can be set only once and cannot be changed (list of cols to index), and may not work w/ schema evolution. enable and disable should be doable, just changing the list of columns to index on the fly is not feasible.

just that we should not miss or regress any core flow by accomodating changes to support advanced use-cases like config changes and schema evolution. may be there are lot more scenarios to consider like column renaming, integrating schema evolution w/ col stats etc. So, we can take it up for 0.12.

@alexeykudinkin
Copy link
Contributor Author

@codope yes after this patch it will be able to handle it -- on the read path we're not relying on writer's config, instead we use whatever is in the Index as the source of truth and play by that (which helps us also built in resilience against any indexing progress gaps, schema evolution, etc)

@nsivabalan sorry, heading might be misleading, Schema Evolution is just one of the cases that might lead to crashes in this flow but is the one that is easier to explain. We have to bring this PR in 0.11, as it covers other critical cases as well (for ex, when query contains columns that are not indexed)

@alexeykudinkin
Copy link
Contributor Author

Also, this PR is very limited in scope and has practically 100% test coverage. I see no risk in this PR landing.

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Discussed offline. Drop columns can be handled with this patch. Essentially, this patch is not about schema evolution. It is about difference in what columns have been indexed and what's being queried. Schema evolution is one use case where it would be helpful. But, to cover schema evolution comprehensively including renames, we need some more work. However, that should not block this patch from landing.

@nsivabalan nsivabalan merged commit 458fdd5 into apache:master Apr 11, 2022
xushiyan pushed a commit that referenced this pull request Apr 14, 2022
…5275)

Currently, Data Skipping is not handling correctly the case when column-stats are not aligned and, for ex, some of the (column, file) combinations are missing from the CSI.

This could occur in different scenarios (schema evolution, CSI config changes), and has to be handled properly when we're composing CSI projection for Data Skipping. This PR addresses that.

- Added appropriate aligning for the transposed CSI projection
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants